Leer texto

Lo primero que tenemos que hacer es cargar el texto. Para nuestro ejemplo, cargaremos una obra del proyecto Gutenberg.



In [18]:

    
fileName='book.txt'

Ahora vamos a eliminar todo aquello que no se consideren cadenas de texto válidas. Para ello definiremos una función que elimine aquello que no queremos contabilizar.



In [19]:

    
import re

def removePunctuation(text):
    return re.sub('[^a-z| |0-9]', '', text.strip().lower())

Ahora vamos a crear el primer RDD del contenido del libro.



In [21]:

    
shakespeareRDD = (sc
                  .textFile(fileName, 8)
                  .map(removePunctuation))



In [22]:

    
shakespeareRDD.take(4)









    Out[22]:





[u'the project gutenberg ebook of anecdotes of animals by unknown',
 u'',
 u'this ebook is for the use of anyone anywhere at no cost and with',
 u'almost no restrictions whatsoever  you may copy it give it away or']



In [23]:

    
print '\n'.join(shakespeareRDD
                .zipWithIndex()  # to (line, lineNum)
                .map(lambda (l, num): '{0}: {1}'.format(num, l))  # to 'lineNum: line'
                .take(15))









    



0: the project gutenberg ebook of anecdotes of animals by unknown
1: 
2: this ebook is for the use of anyone anywhere at no cost and with
3: almost no restrictions whatsoever  you may copy it give it away or
4: reuse it under the terms of the project gutenberg license included
5: with this ebook or online at wwwgutenbergorg
6: 
7: 
8: title anecdotes of animals
9: 
10: author unknown
11: 
12: illustrator percy j billinghvrst
13: 
14: release date may 11 2008 ebook 25428



In [25]:

    
shakespeareWordRDD=shakespeareRDD.flatMap(lambda x: x.split(' '))



In [26]:

    
shakespeareWordCount=shakespeareWordRDD.count()



In [27]:

    
print shakespeareWordCount



In [28]:

    
print shakespeareWordRDD.take(10)









    



[u'the', u'project', u'gutenberg', u'ebook', u'of', u'anecdotes', u'of', u'animals', u'by', u'unknown']

A continuación vamos a detectar cual es la palabra que más veces aparece en el texto. Generaremos un ranking de las 10 más numerosas para que se vea parte del poder de spark.



In [32]:

    
ranking= shakespeareWordRDD.map(lambda x: (x,1)).reduceByKey(lambda y,z: y+z).takeOrdered(10,lambda x : -x[1])



In [34]:

    
print ranking









    



[(u'', 5818), (u'the', 1820), (u'of', 746), (u'and', 708), (u'to', 707), (u'a', 687), (u'in', 372), (u'he', 332), (u'his', 328), (u'was', 327)]